AITopics | bert model

Collaborating Authors

bert model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

239f914f30ea3c948fce2ea07a9efb33-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 03:11:01 GMT

example pattern, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.69)

Add feedback

MosaicBERT: ABidirectional Encoder Optimized for Fast Pretraining

Neural Information Processing SystemsApr-24-2026, 13:30:24 GMT

Although BERT-style encoder models are heavily used in NLP research, many researchers do not pretrain their own BERTs from scratch due to the high cost of training. In the past half-decade since BERT first rose to prominence, many advances have been made with other transformer architectures and training configurations that have yet to be systematically incorporated into BERT. Here, we introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining. This efficient architecture incorporates FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), a module to dynamically remove padded tokens, and low precision LayerNorm into the classic transformer encoder block. The training recipe includes a 30% masking ratio for the Masked Language Modeling (MLM) objective, bfloat16 precision, and vocabulary size optimized for GPU throughput, in addition to best-practices from RoBERTa and other encoder models. When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GBGPUs at a cost of roughly $20. We plot extensive accuracy vs. pretraining speed Pareto curves and show that MosaicBERT base and large are consistently Pareto optimal when compared to a competitive BERT base and large. This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.94)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DiffuPac: Contextual Mimicry in Adversarial Packets Generation via Diffusion Model

Neural Information Processing SystemsFeb-18-2026, 16:22:27 GMT

Deep Learning (DL) have significantly enhanced Network Intrusion Detection Systems (NIDS), improving the effectiveness of cybersecurity operations.

data mining, machine learning, natural language, (23 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Saitama Prefecture > Saitama (0.04)
North America > United States (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre:

Research Report > Experimental Study (0.93)
Workflow (0.67)
Research Report > Promising Solution (0.67)
Overview (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military > Cyberwarfare (0.34)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Networks (1.00)
(3 more...)

Add feedback

DRONE: Data-awareLow-rankCompressionfor LargeNLPModels

Neural Information Processing SystemsFeb-11-2026, 22:43:50 GMT

In this paper,we observethat the learned representation ofeach layer lies inalowdimensional space.

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Appendix A and Generalization

Neural Information Processing SystemsFeb-11-2026, 19:07:54 GMT

The directional derivative of the loss function is closely related to the eigenspectrum of mNTKs. For deep models, as mentioned in (Hoffer et al., 2017), the weight distance from its initialization Combining Lemma 2 and Eq. 18, we can discover that as training iterations increase, the model's Rademacher complexity also grows with its weights more deviated from initializations, which We generally follow the settings of Liu et al. (2019) to train BERT All baselines of VGG are initialized with Kaiming initialization (He et al., 2015) and are trained with SGD for Network pruning (Frankle & Carbin, 2018; Sanh et al., 2020; Liu et al., 2021) applies various criteria MA T is the first work to employ the principal eigenvalue of mNTK as the module selection criterion. Table 5 compares the extended MA T, the vanilla BERT model, and SNIP (Lee et al., 2018b) in terms In our implementation, we apply SNIP in a modular manner by calculating the connection sensitivity of each module. In contrast, using the criteria of MA T, we prune 50% of the attention heads while training the remaining ones by MA T. This approach leads to a further acceleration of computations by 56.7% Turc et al. (2019), we apply the proposed MA T to BERT models with different network scales, namely

artificial intelligence, eigenvalue, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

b6af2c9703f203a2794be03d443af2e3-Paper.pdf

Neural Information Processing SystemsFeb-9-2026, 23:15:10 GMT

In this work, we combine these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, we indeed find matching subnetworks at 40% to 90% sparsity.

machine learning, natural language, subnetwork, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Industry: Leisure & Entertainment (0.31)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

7a6a74cbe87bc60030a4bd041dd47b78-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 01:59:08 GMT

amazonaw, iteration, main content, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.90)

Add feedback

4fc81f4cd2715d995018e0799262176b-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 21:59:24 GMT

Two other important techniques are mixed precision training [36] and in-place activated BatchNorm [53]. Mixed precision training involves training using both 32-bit and 16-bit IEEE floating point numbers depending onthenumerical sensitivityofdifferent layers [36].

artificial intelligence, machine learning, xij, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.33)

Add feedback